70 research outputs found

    Quality Assessment of Linked Datasets using Probabilistic Approximation

    Full text link
    With the increasing application of Linked Open Data, assessing the quality of datasets by computing quality metrics becomes an issue of crucial importance. For large and evolving datasets, an exact, deterministic computation of the quality metrics is too time consuming or expensive. We employ probabilistic techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient estimation for implementing a broad set of data quality metrics in an approximate but sufficiently accurate way. Our implementation is integrated in the comprehensive data quality assessment framework Luzzu. We evaluated its performance and accuracy on Linked Open Datasets of broad relevance.Comment: 15 pages, 2 figures, To appear in ESWC 2015 proceeding

    Cache-oblivious index for approximate string matching

    Get PDF
    This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P, we can find all its k-error matches in T efficiently. This problem is well-studied in the internal-memory setting. Here, we extend some of these recent results to external-memory solutions, which are also cache-oblivious. Our first index occupies O((nlog kn)B) disk pages and finds all k-error matches with O((|P|+occ)B+log knloglog Bn) I/Os, where B denotes the number of words in a disk page. To the best of our knowledge, this index is the first external-memory data structure that does not require Ω (|P|+occ+poly(logn)) I/Os. The second index reduces the space to O((nlogn)B) disk pages, and the I/O complexity is O((|P|+occ)B+log k(k+1)nloglogn) . © 2011 Elsevier B.V. All rights reserved.postprin

    A Bulk-Parallel Priority Queue in External Memory with STXXL

    Get PDF
    We propose the design and an implementation of a bulk-parallel external memory priority queue to take advantage of both shared-memory parallelism and high external memory transfer speeds to parallel disks. To achieve higher performance by decoupling item insertions and extractions, we offer two parallelization interfaces: one using "bulk" sequences, the other by defining "limit" items. In the design, we discuss how to parallelize insertions using multiple heaps, and how to calculate a dynamic prediction sequence to prefetch blocks and apply parallel multiway merge for extraction. Our experimental results show that in the selected benchmarks the priority queue reaches 75% of the full parallel I/O bandwidth of rotational disks and and 65% of SSDs, or the speed of sorting in external memory when bounded by computation.Comment: extended version of SEA'15 conference pape

    Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array

    Full text link
    The longest common prefix (LCP) array is a versatile auxiliary data structure in indexed string matching. It can be used to speed up searching using the suffix array (SA) and provides an implicit representation of the topology of an underlying suffix tree. The LCP array of a string of length nn can be represented as an array of length nn words, or, in the presence of the SA, as a bit vector of 2n2n bits plus asymptotically negligible support data structures. External memory construction algorithms for the LCP array have been proposed, but those proposed so far have a space requirement of O(n)O(n) words (i.e. O(nlogn)O(n \log n) bits) in external memory. This space requirement is in some practical cases prohibitively expensive. We present an external memory algorithm for constructing the 2n2n bit version of the LCP array which uses O(nlogσ)O(n \log \sigma) bits of additional space in external memory when given a (compressed) BWT with alphabet size σ\sigma and a sampled inverse suffix array at sampling rate O(logn)O(\log n). This is often a significant space gain in practice where σ\sigma is usually much smaller than nn or even constant. We also consider the case of computing succinct LCP arrays for circular strings

    Fading histograms in detecting distribution and concept changes

    Get PDF
    The remarkable number of real applications under dynamic scenarios is driving a novel ability to generate and gatherinformation.Nowadays,amassiveamountofinforma- tion is generated at a high-speed rate, known as data streams. Moreover, data are collected under evolving environments. Due to memory restrictions, data must be promptly processed and discarded immediately. Therefore, dealing with evolving data streams raises two main questions: (i) how to remember discarded data? and (ii) how to forget outdated data? To main- tain an updated representation of the time-evolving data, this paper proposes fading histograms. Regarding the dynamics of nature, changes in data are detected through a windowing scheme that compares data distributions computed by the fading histograms: the adaptive cumulative windows model (ACWM). The online monitoring of the distance between data distributions is evaluated using a dissimilarity measure based on the asymmetry of the Kullback–Leibler divergence.The experimental results support the ability of fading his- tograms in providing an updated representation of data. Such property works in favor of detecting distribution changes with smaller detection delay time when compared with stan- dard histograms. With respect to the detection of concept changes, the ACWM is compared with 3 known algorithms taken from the literature, using artificial data and using pub- lic data sets, presenting better results. Furthermore, we the proposed method was extended for multidimensional and the experiments performed show the ability of the ACWM for detecting distribution changes in these settings

    Refinement type contracts for verification of scientific investigative software

    Full text link
    Our scientific knowledge is increasingly built on software output. User code which defines data analysis pipelines and computational models is essential for research in the natural and social sciences, but little is known about how to ensure its correctness. The structure of this code and the development process used to build it limit the utility of traditional testing methodology. Formal methods for software verification have seen great success in ensuring code correctness but generally require more specialized training, development time, and funding than is available in the natural and social sciences. Here, we present a Python library which uses lightweight formal methods to provide correctness guarantees without the need for specialized knowledge or substantial time investment. Our package provides runtime verification of function entry and exit condition contracts using refinement types. It allows checking hyperproperties within contracts and offers automated test case generation to supplement online checking. We co-developed our tool with a medium-sized (\approx3000 LOC) software package which simulates decision-making in cognitive neuroscience. In addition to helping us locate trivial bugs earlier on in the development cycle, our tool was able to locate four bugs which may have been difficult to find using traditional testing methods. It was also able to find bugs in user code which did not contain contracts or refinement type annotations. This demonstrates how formal methods can be used to verify the correctness of scientific software which is difficult to test with mainstream approaches

    Hybrid Statistical Estimation of Mutual Information for Quantifying Information Flow

    Get PDF
    Analysis of a probabilistic system often requires to learn the joint probability distribution of its random variables. The computation of the exact distribution is usually an exhaustive precise analysis on all executions of the system. To avoid the high computational cost of such an exhaustive search, statistical analysis has been studied to efficiently obtain approximate estimates by analyzing only a small but representative subset of the system's behavior. In this paper we propose a hybrid statistical estimation method that combines precise and statistical analyses to estimate mutual information and its confidence interval. We show how to combine the analyses on different components of the system with different precision to obtain an estimate for the whole system. The new method performs weighted statistical analysis with different sample sizes over different components and dynamically finds their optimal sample sizes. Moreover it can reduce sample sizes by using prior knowledge about systems and a new abstraction-then-sampling technique based on qualitative analysis. We show the new method outperforms the state of the art in quantifying information leakage

    Cache-Oblivious Model

    No full text
    corecore